R 



E P O R T 



RESUMES 



ED 012 9c2 A,i GOO 038 

ENGLISH LANGUAGE PROFICIENCY TESTING AND THE INDIVIDUAL. 

BY- HOLTZMAN, PAUL D. 

PENNSYLVANIA STATE UNIV. , UNIVERSITY PARK 

PUB DATE 27 APR 67 

EDRS PRICE MF-$0.25 HC-$0.60 15P. 



DESCRIPTORS- >}:LANGUAGE TESTS i 5>=F0REIGN STUDENTS, TESOL, TEST 
VALIDITY, TEST RESULTS, TEST INTERPRETATION, STUDENT TESTING, 
TESTING PROBLEMS, LANGUAGE ABILITY, SECOND LANGUAGE LEARNING, 
❖DATA ANALYSIS, ❖FACTOR ANALYSIS, 



THE AUTHOR POINTS OUT PROBLEMS IN TEST RESEARCH AND 

interpretation, some of which are due to conflicts between 

THE FINDINGS OF THE DATA ANALYST WHO IS RESTRICTED TO BASING 
HIS DECISIONS ON SELECTED DATA ONLY, AND THE TEST INTERPRETER 
WHO lij AWARE OF VARIABLE VALIDITIES OF SUCH UNTESTED FACTORS 
AS SITUATIONAL ANXIETY, PERSONALITY, MOTHER-TONGUE 
INFLUENCES, CULTURAL CLASH, AND SENSE OF COMMUNICATION, 
HOWEVER, THE AUTHOR FEELS IN SPITE OF THESE AND OTHER 
SHORTCOMINGS, THERE ARE A NUMBER OF REASONS FOR CONTINUING TO 
DO FACTOR ANALYSIS OF TEST RESULTS. ONE FACTOR, 

"FEEDFORWARD," BASED ON THE PSYCHOLOGY OF PERCEPTUAL 
EXPECTANCE, DEALS WITH SETS OF THE CATEGORIES THAT 
INDIVIDUALS HAVE AVAILABLE FOR THE PROCESSING OF ANY INTERNAL 
AND external perceptions INCLUDING THOSE FOR LANGUAGE 
-RECEPTION AND PRODUCTION. A VALID TEST OF 'LANGUAGE 
PROFICIENCY WOULD-BE A TEST OF THE CATEGORIES THAT THE 
SUBJECT BRINGS TO ANY- PROCESSING OF THE LANGUAGE, THE AUTHOR 
REVIEWS RECENT' AND CURRENT RESEARCH WHICH IS CONCERNED WITH 
THE FACTOR OF "REDUNDANCY UTILIZATION" ,. THE ABILITY OF THE 
NATIVE SPEAKER TO PREDICT SEQUENTIAL LANGUAGE SIGNALS AS 
CONTRASTED WITH THE NON-NATIVE SPEAKER'S DEPENDENCY ON 
INTERPRETING EACH WORD ON THE BASIS OF THE SIGNAL ITSELF. 

THIS WORKPAPER WAS PRESENTED AT THE ATESL SEMINAR IN AUSTIN, 
TEXAS, APRIL 27, 1967. (AM) 



O 

ERIC 



ED012922 



ENGLISH LAlJGUAGE PROFICIENCY TESTING 
* MD THE II«)IVIDUAL 

U.S. 0E?AR1MEH1 OF HEAllH, EOUCAIIOH & WELFARE 
OFFICE OF EDUCATION 



THIS DOCUMENT HAS BEEN REPRODUCED EXACTLY AS RECEIVED FROM THE 
■ PERSON OR ORGANIZATION ORIGINATING IT. POINTS OF VIEW OR OPINIONS 
• STATED DO NOT NECESSARILY REPRESENT OFFICIAL OFFICE OF EDUCATION 
POSITION OR POLICY. 



A workpaper for the ATESL Seainar 
on testing 

Houston* Texas 
April 27, 1967 



by Paul D» Holtznian 



; IHE Emc SYSTEM MWS PEDMISSOIl OF 



THE PEHNSYLVANIA STATE UI-IIVERSITY 
Graduate School Language Testing Center 
for International Students 
University Park, Pennsylvania 




060 



6B8 



We might parody thus: A test interpreter looking at a set of EtP scores 

exclaimed that Mr* Kashimura will do very irell in his petroleum engineering curricu** 
lum. His assistant asked, ”How can you tell?” "Because he is obviously highly 
motivated*” 

The Educational Testing Service, as a corporate data analyst, carefully avoids 
statements of criteria levels* The Pennsylvania Department of Public Instruction 
has contracted with ETS to test candidates for certification as secondary school 
language teachers. As corporate test interpreter, the DPI established a criterion 
score for certification* It was inevitable that a student would come along with a 
record of long and— by other criteria, successful— stud5^ in a language and with an 
exceptionally effective practice teaching experience who scored Just one point below 
the required level on the ETS-MIA test. Test interpreters in our College of Educa^ 
tion are up in arms* They perceive the other data but, by law, the DPI Bxareau of 
Certification cannot* 

B* The data analyst->-so long as he is only the data analyst— is never aware 
of the human consequences of his decisions* Any sensitive test interpreter cannot 
escape this knoi/ledge* ^ 

In the case Just cited (and it is a real case) the test interpreters are upset 
by a host of human consequences: effects upon the would-be teacher and a real loss 
to students somewhere exposed to a less effective teacher who may have scored Just 
above the criterion. 

Confrontation of human consequences of test score interpretations is an almost 
daily occurance in my office* Mr* Kuo was one of five graduate students in Chemistry 
who were told that they must achieve minimum English proficiency by the end of their 
first term* If not, their assistantships would not be renewed the following year* 
End-term test results showed that the other four had achieved "minimum proficiency” 
or better, that Mr* Kuo had not, and that Mr# Kuo had made more progress than the 



3 



other four. But to him the others had succeeded where he had failed and this made 
it almost impossible for him to face his Department. 

Mr. Bolivar was a perennial student of English. After two years he had ccia« 
pie ted all of the requirements for a master’s degree in petroleum engineering but 
had not achieved the "minimum proficiency” required of all candidates by the Graduate 
School—at least not according to the test scores. V7hen» on the basis of human 
consequences I it was reported that he had at last met the requirement » even the 
foreign student adviser was critical— of the ELP test. 

On the other hand# there was Mr. Pak. He completed his courses, left the 
Ihiiversity, and submitted a thesis from somewhere in the world. For several terms 
his neme was removed from the graduation list because there was no evidence of his 
meeting the ELP requirement. At last he took the TOEFL in Japan. For several days 
past the printer's deadline the Graduate School held the graduation list. At last 
the TOEFL score arrived; about 330. The ensuing conversations between Mr. Pak's 
Department and the Dean of the Graduate School no doubt dealt with human consequences 
The degree was awarded. 

I am sure that all members of this seminar can match me story for story 
except, possibly, those who meet my definition of "data analyst.” 

C. Tlte data analyst, given adequate reliability, bases his decisions on 
assumed validity. The test interpreter is aware of variable validities attributable 
to untested factors such as situational anxiety, personality, mother-tongue influ- 
ences, cultural clash, and sense of communication. 

Consider, for instance, the statistical process of item analysis » Under the 
heading of "discrimination" what do we look for? The extent to which each item 
discriminates between those who do well and those who do poorly on the total test. 
Ihus, in essence, we seek to improve reliability, assuming validity for total test 
scores, (see also Problem 3) 

o 

ERIC 



k 



Teachers of foreign languages in our schools and colleges have long been aware 

of the fact that their students who may be academically equal, according to tests, 

will vary over a idde range of abilities in using the foreign language to communi- 

cate with natives* Yet it is this ability to communicate in the language, rather 

than knowledge of the language, that we are trying to test. As far as I knovr there 

is no dependable test of this ability for second language learners. We are Just 

3 

beginning to develop one for American students. 

D. The data analyst is engaged in transmission of data; the test interpreter 
is engaged in human interaction. 

In the transmission of data, meaning is irrelevant so long as the data received 
are the same as the data sent. In a communication transaction, meanings are more 
important than the data. Test interpretation is, obviously, a communication trans- 
action. The test interpreter, then, must take into account the meanings of what he 
has to report to the student, to his adviser, to his department. 

Problem 2 . The goal of 13LP test research seems to be one of devising means of 
validating all ELP decisions on the basis of data analysis. Yet validation is 
always statistical. This means that we can only provide valid descriptions of groups 
and can never account for all variances from derived statistical norms. 

We find that a test has high reliability but never perfect reliability, v/hat 
does this mean? It means that for seme individuals the test lacks consistency. We 
validate reliable tests by determining correlations between test scores and those 
derived frem some criterion measure (hopefully also reliable). The correlation is 

2 

See for instance Kenneth L. Pike, "Wucleation ,” The Modern Language Journal . 
Ilovember, I960, 291-295* Reprinted in Harold B. Allen, Teaching English as a Second 
Language » McGraw-Hill, I 965 , pp. 67-7^1. 

3 

Called a "Ccomsense Inventory,” it attempts to test the extent to which 
students are audience-centered in their concepts of speaking and writing. 



o 



5 



never perfect* Darrell Huff reminds us to "keep in mind that a correlation may he 
real and based on real cause and effect— and still be almost worthless in deter- 
mining action in any single case*"^ 

Problem 3 * Lines of research are subject to the ccoanon criticism of any 
detective work: "They look everywhere until they find a suspect, but they’re likely 

to concentrate on him from then on*" To throvr in another McLuhan aphorism, "As we 
begin, so shall we go." 

Again, our statistical methods help us (or force us) to concentrate on "a 
suspect" or test variable* One already mentioned is item analysis which we use to 
increase not the validity but the reliability of our tests, to improve "concentra- 
tion" on the test variable* 

A popular statistical method that does not inherently force such concentration 
but may be used to do so is factor analysis. Having arrived at the conclusion that 

7 

we had isolated a general ELP factor, we have eliminated tests which did not "load" 
on that factor. Yet the outcome of a factor analysis is deterained by what it put 
From our o^m research, here are some examples which have not been previously reported 

First, Table I shows the kind of concentration that accrued when we proceeded 

Q 

from the results reported earlier to refined tests and a new factor analysis* 

These are data derived on the Penn State ELP test from Indiana University students* 

^ovr to Lie with Statistics * V/* If. Horton, 1954, p. 93. 

^Frcm Harry Kemelman, Friday the Rabbi Slept Late , Fawcett Crest, 1965, 
p* ll4. 

^The Medium Is the Massage * p. 45* 

^See Richard E* Spencer and Paul D. Holtzman, "It’s Composition— But Is It 
Reliable?" C<’»’4ege Composition and Communication , May, I 965 , PP* 117-121* 

%bid * 




6 



TABLE I 



FACTOR AIJALYSIS SHOV7IIIG OiJLY 
HIGHLY SIGHIFICAI^T LOADINGS 



Test 


I 


II 


III 


IV 


V 


1. Sound discrimination 


.51 






,U6 


.1»8 


2, Accuracy of dictation 










.97 


3* Written structure 






.86 






k. Attitude toward English 






.73 






5. Word fluency 


.76 










6. Paragraph reading 


.82 




.k2 






7. Scrambled text 








.92 




6e Vocabulary 




.60 


.53 






9. Rated intelligibility 




.95 








10. Rated listening ability 


.87 










11. Oral stress 





.95 








% of variance accounted for 


J*0.5 


lU.2 


12. U 


10.9 


9.6 



Within the process of factor analysis is another source of perception control: 
the labeling of factors. In earlier studies ve found a general ELP factor plus 
others which we labeled "academic ability or intelligence" and "attitude toward 
English." The reader might try his hand at labeling the five factors above. It is 
a dangerous practice. 

What should be noted in Table I is that, whatever the factors, they seem to 
account for a large portion of the variance and they seem to be getting at signifi- 
cant aspects of English language proficiency* 

By contrast, however. Table II shows what happens if we introduce more test 
variables, all presumably designed to get at the same abilities assessed by the 
Penn State ELP subtests. These data are from the same subjects (Indiana University) 
and include the data which produced Table I. 




7 



TABLE II 



FACTOR AIIALISIS SHOV7IUG ONLY 
HIGHLY SIGHIFICAilT LOADINGS 



Test 


I II 


III 


IV 


V 


1. Michigan test of aural comprehension 


A5 




-.71 




2. Michigan test (total) 


.U4 -.53 




-.62 




3. TOEFL listening comprehension 


.99 








k, TOEFL structure 


1.05 








5« TOEFL vocabulary 




-.60 






6. TOEFL reading comprehension 


-.73 








7. TOEFL writing ability 


.47 


-.54 


.73 




8. TOEFL total 


.47 


-.40 






9. Listening and speaking* 


-.63 








10. Initiation and cmversation* 


-.72 








11. Interest and motivation* 


-.73 








12. Performance-phonology* 




—.66 






13. Performance-structure* 


-.52 -.44 


-.44 






lU. Aural comprehension* 








.99 


15. Initiation and conversation 




-.89 






l6. Interest and motivation** 


.76 








17. Performance-writing** 




-.66 






18. Performance-longer writing** 








.?4 


19. PSU sound discrimination 




-.44 


-.52 


.49 


20. PSU accuracy of dictation 


.43 


-.49 






21. PSU written structure 


-.67 








22. PSU attitude toward English 


-.43 








23. PSU word fluency 




-.47 


.43 




2 k » PSU paragraph reading 






-.75 




25. PSU scrambled text 








.58 


26. PSU vocabulary 










27. PSU rated intelligibility 




.71 






28. PSU rated listening ability 








.42 


29. PSU oral stress 





- 


..... ... 




% of variance accounted for 


6.21 4.61 


3.1*9 


2.93 


2.77 



instructor ratings—classes In spoken English 
**instructor ratings-- classes in written English 

Table II is presented for no purpose other than to confirm the idea that 
factor anailysis results are a function of the data submitted or» as someone has put 
it .more succinctly: "Garbage in— garbage out." Little of the variance is accounted 

for. Basically similar tests load on different factors. IThat this tells us is that 
we are dealing with such complexities of interacting variables as to challenge 
reduction to simple scores and assessment on the basis of those scores alone. 




Further 9 it seems clear that concentration on one variable or set of variables 
to the exclusion of ethers in test development and application is fraught with 
dangers. 

II. Perceptual expectancy and the search for an integrative factor 

In spite of the results cited above, there are a number of reasons for continur 
ing to do factor analysis of test results. Some of these have to do with refinement 
of the tests themselves; some have to do ^rith diagnosis on the basis of more or less 
independent factors; some have to do with teaching. If two skills are closely 
related, if they are co-incident, must each be taught? The teacher is always looking 
for integrative factors~for the skills whose development automatically increase 
abilities in other skills. In our program, for instance, we have found that empha- 
sis on the learning of unstressing patterns and rhythms obviate the necessity of 
drilling on certain phonemes. The unique unstressing behaviors of English speakers 
are, we believe, an integrative factor. 

A far more important Integrative factor— of significance in both testing and 
teaching— seems to be ignored except in a few recent research efforts. This is what 
I. A. Richards might call a feedforward factor. It is based in the psychology of 
perceptual expectancy. It deals with sets of the categories that individuals have 
available for the processing of any internal and external perceptions including 
those for language reception and production. 

"It's all Greek to me" is a statement that anything that does not fit in the 
category "English" is perceived in the category "Greek;" or that any English that I 
don't understand might as well be filed with Greek which I also don't understand. 

When a speaker of the General American dialect visits peurts of Texas he finds that 
bis expectancy for the first person singular pronoun is often violated. Since he 
cannot "file" it under /al/ and he does not have the category /a:/, he "files” it 
under /a/. He is helped in this, of course, by the fiction writers who spell it. Ah. 




9 



Learning a nev language is a process of structuring new, complex sets of 
perceptual expectancies— both for reception and for production of the language. 

It might follow, then, that a valid test of language proficiency would be a test of 
the categories that the subject brings to any processing of the language. These 
categories necessarily include expectancies for not only words or vocabulary but 
for €dl of the variables that we have been trying to test. They would seem to em- 
prise anu integrative language factor. 

Most of our tests, however, force the student to respond with the examiner *s 
categories rather than his own; with the test-writer's words and sounds and struc- 
tures whether or not they are also the student's. The multiple choice item is a 
case in point. The language of response is chosen frm the limited categories drawn 
frm the perceptual possibilities projected by the test constructer. 

Can we assess the appropriateness of a language-learner's categories for 
language perception? The question can't be answered, of course. But several lines ' 
of research suggest that the answer might be yes , within limits . 

All of the research reviewed below is cmcemed, in one way or another, with 

o 

a factor of "redundancy utilization." In normal (first language) function we know 
what to expect— and therefore what categories to have available— on the basis of 
redundancy in .the language, in the situation, in the "image" of the speaker or 
listener or writer or reader, in the context, and so on. We know what to expect. 

We use the language signal as best we can to confirm or deny the expectancy. 

Alan C. Nichols has found that Merican and foreign students have different 
patterns of error in writing dictated sentences categorized has having "high 
naturalness." For successive words, native speakers made increasing errors toward 
the middle of each sentence and fever errors toward the end. The foreign students 

^Frm Wendell W. Weaver and Albert J. Kingston, "A factor analysis of the 
Cloze procedure and other measures of reading and language ability," Journal of 
Communication . XIII :4, 19^3, pp. 252-261. 

er|c 
















10 

(Japanese) made errors at about the same rate in all parts of each sentence. It 
seems appropriate to conclude that the native Speakers were able to make use of the 
redundancy of the ^highly natural" sentences to predict with increasing accuracy 
what would follow. The Japanese students were less able to predict and more depen- 
dent on interpreting each word on the basis of the signal itself. 

In later research, Nichols administered his test of Memory Span for Immediate 
Recall (MSIR) along with the Penn State tests. MSIR scores correlated most highly 
with those subtests which might seem to require some redundancy utilization and 
with all but one of the subtests that require the subject to use his own language. 

TABLE III 

F 4 IK ORDER CORRELATIONS 
BETWEEN MSIR AND PENIil STATE SUBTESTS 




Rated Intelligibility 


.64 


Rated Listening Ability 


.64 


Dictation 


• 58 


Sentence Completion 


.57 



I have taken the term "redundancy utilization" from the work of Vfeaver and 
Kingston who compared the MLAT, a batteiy of comprehension and vocabulary tests, 
and Cloze procedure in a factor analysis. Having put Cloze procedure in, they got 
Cloze procedure out as a factor and then concluded that it was perhaps little more 
than "an interesting curiosi-ty." Carroll and others^^ conducted a "pilot investiga- 
tion ’ of the applicability of Cloze procedure in testing foreign language achievement. 



"Apparent factors leading to errors in audition made by foreign students," 
Speech Monographs . XXXI :1, March, 1964, pp. 85 - 9 I. 

11 

op . cit 

12 

John B, Carroll, Aaron S. Carton, and Claudia P. V7ilds, An Investigation 
of "Cloze" Items in the Measurement of Achievement in Foreign Languages, Laboratory 
for Research in Instruction, Gractmte School of Education, Harvard University, 1959. 






Some of their conclusions include: 



!• The fact that cloze scores are so hi{;hly correlated with various 
factors of cognitive ability when the testing is in the sub;Ject*s 
native language raises grave question as to the potential efficacy 
of the cloze procedure as a measure of the subject's achievement 
in a foreit^ language* (p. 66) 

2* • • • results strongly suggest that the cloze tests are in fact 

measuring sc»ae important facet of foreign language proficiency— but 
this is much more true for groups than for individuals * That is to 
sayt If we use group results to Cancel out individual variaticms on 
all the extraneous factors which may contribute to the determination 
of cloze test scores » the group means reflect real differences in 
foreign language competence* (p* 85 ) 

3* Another kind of evidence of validity is to be fomid in the correlation 
of cloze test scores idth teachers* grades* It is a common uyth among 
educational psychologists that teachers* grades are notoriously 
unreliable; but this does not seem to be true» necessarily » of 
teachers* grades in foreign language courses, which are frequently 
found to correlate highly enough with other variables to suggest 
that they are quite reliable* (p* 8^) 

Cloze procedure, in essence, tests for arhat word a reader ^rould expect to read 

(or hear) where the original word has been deleted in a paragraph. Bankin, who 

13 

critically evaluated Cloze procedure, found highly stable correlati^s between 
scores based upon production of the original word in each case and scores based upon 
production of a ^rord which ”made sense" in each case* That was with native speakers 
of English* In the study cited above, Carroll and others compared original-word 
and "community of response" scoring, finding the latter slightly more reliable but 
sli{^tly less correlated with CEEB scores* Last month ve scored our Cloze test two 
ways— on the basis of original word and on the basis of "makes sense"— and found 
no correlation for the foreign students tested* 

In a study of "Cloze procedure as a test of English language proficiency," 

Hopf and Spielmann found some variations attributable to the form classes of words 



^^E* F* Rankin, Jr*, "An Evaluation of the Cloze Procedure as a Technique 
for Measuring Reading Comprehension, impublished dissertation, U* of jllchigan, 19^7 « 




12 



which happened to be deleted in two forms* This suggests that words for deletion 
should be selected not by chance but for the testing purpose. In a new study ^ 
Spielmann is attempting to find ways to reduce variance attributable to outside 
influences. He will include a hypothesis of greater validity of Cloze procedure 
scores when corrected for each student's Cloze ability in his native language 
(beginning with Spanish speakers), 

Spolsky is engaged in essentially the same line of research into redundancy 

utilization, it seems to me, in the studies reported last year in his ’’Progress 

lU 

report, January-September 1966,” He hypothesizes that ’’overall proficiency in a 
language • , » may be measured by testing a subject's ability to send and receive 
messages under varying '‘onditions of distortion of the conducting medium,” IHiere 
our work with Cloze has begun with ’’distortion” of the written language, Spolsky 
has begun with distortion of the spoken language (by introducing white noise at 
discrete intensity levels). 

Both Spolsky and Spielmann plan to apply their experimental tests to 
languages other than English, They may not have the same comparisons in mind but, 
in any case, should produce some data that may offer encouragement to those of us 
who would assess that integrative factor of linguistic perceptual expectancy 
(operationally defined, perhaps, as redundancy utilization). 



ill 

Bernard Spolsky, Preliminary Studies in the Development of Techngtiues for 
Testing Overall Second Language Proficiency, Indiana University, 1966, (mimeograph) 

o 

ERIC 



